Introduction:

In this project the red wine data will be analysed. The main aim of the project is to understand which of variables in the dataset impact the quality of the wine. This will be understood by performing Exploratory Data Analysis(EDA) on the dataset. We will perform Univariate analysis, Bivariate Analysis and Multivariate analysis on the variables to understand the data and variables.

## [1] "C:/udacity"

The data has been loaded into the redWineData, we will be running the str function on the dataset to view the variables present.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599   13

Observation regarding the dataset

  • There are 1599 rows and 13 columns(variables) in the dataframe
  • We need to understand which of the variable has impact on the quality variable.
  • Few points regarding the variables Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
  • Some description for the variables: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Creating a new variable rating, this will categorise the wines based on the quality and total acidity.

Check if the newly added column are preset in the dataframe

## 'data.frame':    1599 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ rating              : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
##  $ total_acidity       : num  8.1 8.68 8.56 11.48 8.1 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality          rating     total_acidity   
##  Min.   : 8.40   Min.   :3.000   bad    :  63   Min.   : 5.120  
##  1st Qu.: 9.50   1st Qu.:5.000   average:1319   1st Qu.: 7.680  
##  Median :10.20   Median :6.000   good   : 217   Median : 8.445  
##  Mean   :10.42   Mean   :5.636                  Mean   : 8.847  
##  3rd Qu.:11.10   3rd Qu.:6.000                  3rd Qu.: 9.740  
##  Max.   :14.90   Max.   :8.000                  Max.   :16.285

Univariate Plot Section

The individual variables will be analysed before finding their impact on the quality of wine. This will help us understand the nature of each variable.

Import all the required libraries

We will be plotting a plot to understand the quality distribution for the dataset. Here is plot for quality:

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 1599 5.64 0.81      6    5.59 1.48   3   8     5 0.22     0.29
##      se
## X1 0.02
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00 
##        range          sum       median         mean      SE.mean 
## 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02 
## CI.mean.0.95          var      std.dev     coef.var 
## 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01

Output

describe(redWineData$quality) vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 5.64 0.81 6 5.59 1.48 3 8 5 0.22 0.29 0.02

stat.desc(redWineData$quality) nbr.val nbr.null nbr.na min max 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00 range sum median mean SE.mean 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02 CI.mean.0.95 var std.dev coef.var 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01

From the above stats and plot it is noted that the maximum observations are rating 5-7. There are no observations with a score of 10.

Plot is for the other variables

##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 8.32 1.74    7.9    8.15 1.48 4.6 15.9  11.3 0.98     1.12
##      se
## X1 0.04

Result of describe function vars n mean sd median trimmed mad min max range skew kurtosis X1 1 1599 8.32 1.74 7.9 8.15 1.48 4.6 15.9 11.3 0.98 1.12 se X1 0.04

The plot shows that the fixed acidity has almost normal distribution. The mean value for the fixed acidity in the dataset is 8.32

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 0.53 0.18   0.52    0.52 0.18 0.12 1.58  1.46 0.67     1.21
##    se
## X1  0

Output of Describe() vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.53 0.18 0.52 0.52 0.18 0.12 1.58 1.46 0.67 1.21 0

The plot for volatile acidity also shows similar characterstics to fxed acidity with normal distribution. The mean for this variable is 0.53

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 3.31 0.15   3.31    3.31 0.15 2.74 4.01  1.27 0.19      0.8
##    se
## X1  0

Output vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 3.31 0.15 3.31 3.31 0.15 2.74 4.01 1.27 0.19 0.8 0

The plot shows a bit of right skewness. The mean of the pH distribution is at 3.31.

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis se
## X1    1 1599 0.27 0.19   0.26    0.26 0.25   0   1     1 0.32    -0.79  0

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.27 0.19 0.26 0.26 0.25 0 1 1 0.32 -0.79 0

The graph also shows a right sided tail. The mean of the citric acid distribution is 0.27. Most of the observations show zero value for citric acid variable, as the spike at 0 is the maximum.

##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 2.54 1.41    2.2    2.26 0.44 0.9 15.5  14.6 4.53    28.49
##      se
## X1 0.04

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 2.54 1.41 2.2 2.26 0.44 0.9 15.5 14.6 4.53 28.49 0.04

The residual sugar distribution is highly right skewed. The mean for the distribution is 2.54.

##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1599 0.09 0.05   0.08    0.08 0.01 0.01 0.61   0.6 5.67    41.53
##    se
## X1  0

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.09 0.05 0.08 0.08 0.01 0.01 0.61 0.6 5.67 41.53 0

The chloride distribution also shows similar charaterstics in terms of skewness as that od residual sugar. the maximum observations have value less than 0.1.

##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1599 15.87 10.46     14   14.58 10.38   1  72    71 1.25     2.01
##      se
## X1 0.26

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 15.87 10.46 14 14.58 10.38 1 72 71 1.25 2.01 0.26

The sulfur dioxide distribution is right skewed with a mean of 15.87.

##    vars    n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 1599 46.47 32.9     38   41.84 26.69   6 289   283 1.51     3.79
##      se
## X1 0.82

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 46.47 32.9 38 41.84 26.69 6 289 283 1.51 3.79 0.82

The total sulfur dioxide distribution is rihght skewed with amean of 46.47

##    vars    n mean   sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1599 0.66 0.17   0.62    0.64 0.12 0.33   2  1.67 2.42    11.66  0

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.66 0.17 0.62 0.64 0.12 0.33 2 1.67 2.42 11.66 0

The sulphate distribution is right skewed with outliers and mean of 0.66.

##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 1599    1  0      1       1   0 0.99   1  0.01 0.07     0.92  0

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 1 0 1 1 0 0.99 1 0.01 0.07 0.92 0

The density distribution is normal distribution with mean of 1. The medium and mean values are the same for this distribution.

##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1599 10.42 1.07   10.2   10.31 1.04 8.4 14.9   6.5 0.86     0.19
##      se
## X1 0.03

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 10.42 1.07 10.2 10.31 1.04 8.4 14.9 6.5 0.86 0.19 0.03

The alchol distribution shows that for majority of the observations the alcohol percent is between 9 and 11. The mean is 10.42

##    vars    n mean  sd median trimmed  mad  min   max range skew kurtosis
## X1    1 1599 8.85 1.7   8.45    8.69 1.42 5.12 16.29 11.17 0.97     1.23
##      se
## X1 0.04

Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 8.85 1.7 8.45 8.69 1.42 5.12 16.29 11.17 0.97 1.23 0.04

The total acidity shows approximately normal distribution the mean is 8.85.

From above plot it can been observed that the majority of the observations fall under the avrage rating category. There are very few observations that fall under the good and bad category. This will lead to a difficulty to find the variables that have impact on the quality of the wine.

Univariate Analysis

Observations in the univariate plots:

  • The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0.
  • The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution
  • The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme.
  • The citric acid variable has a large number of zero values.
  • Quality variable has maximum data in the average category (5 to 7), there are very few observations for good(>7) and bad (0-4) quality of wine.
  • For the maximum number of observations it is seen that the alcohol value is between 9 and 11.

What is the structure of your dataset?

The dataset has 13 variables and 1599 observation. The variables in the dataset are

fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide total.sulfur.dioxide density
pH
sulphates
alcohol
quality

What is/are the main feature(s) of interest in your dataset?

The variable of interest is quality. We want to study the ariables that have impact on quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The expectation is that citric acid,ph,residual sugar, alcohol and total acidity will contribute to the investigate the quality of wine. These factors contribute to the taste of wine determining its quality. So may be the mentioned variables to contribute to its impact on quality.

Did you create any new variables from existing variables in the dataset?

Yes , 2 new variables have been created. total_acidity, this the summation of the volatile acidity and fixed acidity as these 2 variables together determine the acidity of the wine. The second varaible created is rating, this categorises the wines based on their quality score in bad,average and good categories.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • The variables alcohol,density,pH,fixed acidity,volatile acidity and citric acid are normally distributed as the the skewness is closer to 0.
  • The variables sulphate,total sulphur dioxide and free sulphur dioxide are slightly positively skewed in distribution
  • The variables chlorides and residual sugar are highly positive skewed distribution with outliers present in the extreme.
  • The citric acid variable has a large number of zero values.

The x axis and y axis have been set to limits to have a closer view of the data. Plots with and after removing outliers have been plot to understand the distribution of data.

Bivariate Plot Section

We will first ploat a scatterplot matrix, to understand the relation between 2 variables.

We will first try to figure the corelation coeffs for all the variables with quality.

In the plot above it can be seen * Citric acid has a positive corelation with quality. * Volatile acidity has a negative corelation with quality. * Residual Sugar, fixed acidity and chlorides have weak relation with the quality variable. * Itcan be seen that citric acide has strong relation with fixed and volatile acidity.

In the plot above it can be seen * Alcohol has a positive corelation and strongest relation with quality. * Total sulfur dioxide has strong relation with free sulfur dioxide. * Sulphates have a positive corelation with quality. * pH, total acidity and sulfur dioxide have weak relation with teh quality variable.

observations from the scatterplot matrix

  • There are no variables that have strong corelation with quality.
  • From comparison of the corelation coefficient all variables with quality, the below seem to have some relation with the quality alcohol(0.476), volatile acidity (-0.391), sulphates (0.476) and citric acid(0.226)
  • There is strong corelation between citric acid and fixed and volatile acidity.
  • There is strong co relation between total sulfur dioxide and free sulfur dioxode, ph and total acidity

Below we will be plotting graphs with different variables and fixing the y=axis to quality to understand the effect on the quality of wine.

From the above plot it can be seen there exists a strong relation between alcohol and quality.

The volatile acidity has a negative relationship with quality.

The above plot shows that there is no strong relation between residual sugar and quality.

The above plot shows there is no strong relation between pH and quality.

The above plot shows there is a strong positive relation between sulphates and quality.

The exists a positive relataion between citric acid and quality.

Now we will plot graphs for other variables to understand their relationships.

There exists a strong negative relationship between citric acid and volatile acidity.

There exists a strong positive relationship between fixed acidity and citric acid.

There exists a negative relationship between total sulphur dioxide and quality.

There exists no strong relationship between free sulphur dioxide and quality.

There exis a weak negative relationship between density and quality.

Now that there is a relation between the 4 variables and quality we will plot a box plot showing the content of the variables in the rating column

From the above plots it is observed that for good quality wines the alcohol content is high. The volatile acidity is less for good quality wines. The citric acid content is a bit high in good quality wines when compared to bad and average qualities. The sulphates quality wines are between 0.5 and 0.8 for maximum observations

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • There are no variables that display strong relationship with quality. Still there is relationship between alcohol,sulphates,volatile acidity, citric acid and quality Also there is very strong relation observed between citric acid and fixed and volatile acidity Better wines seem to have higher concentration of Citric Acid. Better wines seem to have higher alcohol percentages. Residual sugar has no impact on quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It has been observed that there is a strong relation between pH and total acidity. Also there has been a strong relation observed between citric acid and fixed acidity and citric acid and volatile acidity.

What was the strongest relationship you found?

Relative to quality, alcohol had the strongest relation. Relative to all other ‘different’ variables citric acid and fixed acidity have strong relation.

Multivariate Plot section

Now we will plot multiple variable plots to conclude on the factors that impact wine quality.

We have seen that alcohol has a strong relation with quality, hence we will try to plot different variables with alocohol and quality and try to understand if any of them together have impact on the quality of wine.

There is strong negative relationship between alchol and density.

The sulphates amount when less and alcohol amount high produces high quality wines.

Residual sugar has a weak relationship with alcohol.

From the above plot we can see a postive relationship between pH and alcohol.

Now we will plot graphs by fixing the acidity, this will help us to understand relationships of other variables apart from quality

It can be seen that citric acid and fixed acidity have a strong relationship.

It can be seen that the rsidual sugar does not have strongrelation with fixed acidity.

It can be noted in the above plot that density and fixed acidity when low produce wines with quality score of 8.

Observations:

Quality is high when volatile acidity and density are low Quality gets high with more alcohol and less sulphates Wine has good quality when the amount of alcohol is more and volatile acidity is less. Density has the weakest correlations with quality Residual sugar has no impact on quality

Plotting a graph based on the above observation

From above plots it can be noted that when chloride, sulphates, volatile acidity and citric acid when amount is less and the alcohol amount is high produces good quality wines.

It can be seen that high density amount produces bad quality wines.

There is no impact of quality due to residual sugar.

Linear Model

We will try to plot a linear model based on the data we have analysed so far:

plt1 <- lm(quality ~ alcohol, data = redWineData) summary(plt1)

## 
## Call:
## lm(formula = quality ~ alcohol, data = redWineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73   <2e-16 ***
## alcohol      0.36084    0.01668   21.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16
## 
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = redWineData)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = redWineData)
## m3: lm(formula = quality ~ alcohol + citric.acid + chlorides, data = redWineData)
## m4: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar, 
##     data = redWineData)
## m5: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar + 
##     total_acidity, data = redWineData)
## m6: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar + 
##     total_acidity + sulphates, data = redWineData)
## 
## ======================================================================================================
##                        m1            m2            m3            m4            m5            m6       
## ------------------------------------------------------------------------------------------------------
##   (Intercept)         1.875***      1.830***      2.056***      2.085***      2.000***      1.719***  
##                      (0.175)       (0.171)       (0.186)       (0.187)       (0.233)       (0.228)    
##   alcohol             0.361***      0.346***      0.333***      0.334***      0.336***      0.311***  
##                      (0.017)       (0.016)       (0.017)       (0.017)       (0.017)       (0.017)    
##   citric.acid                       0.730***      0.798***      0.814***      0.767***      0.549***  
##                                    (0.090)       (0.092)       (0.093)       (0.121)       (0.120)    
##   chlorides                                      -1.218**      -1.200**      -1.179**      -2.564***  
##                                                  (0.389)       (0.390)       (0.391)       (0.408)    
##   residual.sugar                                               -0.017        -0.017        -0.010     
##                                                                (0.012)       (0.012)       (0.012)    
##   total_acidity                                                               0.008         0.009     
##                                                                              (0.013)       (0.013)    
##   sulphates                                                                                 1.068***  
##                                                                                            (0.113)    
## ------------------------------------------------------------------------------------------------------
##   R-squared           0.227         0.257         0.262         0.263         0.263         0.302     
##   adj. R-squared      0.226         0.256         0.261         0.261         0.261         0.300     
##   sigma               0.710         0.696         0.694         0.694         0.694         0.676     
##   F                 468.267       276.595       188.675       142.024       113.650       114.880     
##   p                   0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood  -1721.057     -1688.711     -1683.819     -1682.921     -1682.733     -1639.019     
##   Deviance          805.870       773.917       769.196       768.333       768.153       727.280     
##   AIC              3448.114      3385.421      3377.637      3377.842      3379.467      3294.037     
##   BIC              3464.245      3406.930      3404.523      3410.105      3417.106      3337.054     
##   N                1599          1599          1599          1599          1599          1599         
## ======================================================================================================

The above linear model shows the intercepts for all the above variables.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There were few strong relationships that identified in bivariate analysis when combined together had impact on the quality of wine here are the observation: * Quality is high when volatile acidity and density are low * Quality gets high with more alcohol and less sulphates * Wine has good quality when the amount of alcohol is more and volatile acidity is less. Also there were few variables that when added alongwith alcohol showed no impact on the quality. * Density has the weakest correlations with quality * Residual sugar has no impact on quality

Were there any interesting or surprising interactions between features?

Earlier it was assumed that pH and citric acid will have great amount of impact on deciding the quality of the wine. But it was noted that these variables did not have significant impact on the quality of the wine. There were variables like volatile acidity and sulphates which if present in less amount will produce good quality wines.

Final Plots and Summary

Plot 1

##Plot 1 description

It can be noted that the dataset provided contains average quality wines. There are very few observations for good and bad quality wines. This constraint makes it difficult to determine the factors that will impact the quality of wine. It can be noted there are very few observation betwwen 0 - 4 and 7-10 quality score.there are approximately 1200 records in the 5-7 quality score

Plot 2

Plot 2 description

The above plot show that alcohol,sulphates ,volatile.acidity and citric acid have strong corelation with the quality. From the above plot it can be noted that the mean of alcohol percent for good qulaity is aproox. 11%. The volatile acidity is less in amount (mean of 0.4) in the good quality wines when compared to bad and average quality wines. The mean (37.5) of citric acid is more when comapred to the same in bad and average quality wines. The sulphate amounts are present in very less amounts for all the 3 categories of wine. It can be seen the amount of these four variables can have an impact on the quality of the wine.

PLot 3

#Plot 3 description It is observed that we can get good quality wine when the volatile. Acidity and sulphates amount are less and alcohol content is high. There is no impact of density and pH on quality of wine.

Reflection

It was thought that pH and density will contribute a major role on the quality of wine, before beginning the bivariate and multivariate analysis. It was only alcohol that played the part to the quality before and after the analysis.

After the analysis it was found that high amount of alcohol and less amount of sulphates and volatile acidity can produce good quality wines.

For future work if the dataset with good and bad rating wines is procured, the variables impacting the quality can be better determined.